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Abstract 

This paper investigates how to extract objects-of-interest without relying on hand¬ 
craft features and sliding windows approaches, that aims to jointly solve two sub¬ 
tasks: (i) rapidly localizing salient objects from images, and (ii) accurately seg¬ 
menting the objects based on the localizations. We present a general joint task 
learning framework, in which each task (either object localization or object seg¬ 
mentation) is tackled via a multi-layer convolutional neural network, and the two 
networks work collaboratively to boost performance. In particular, we propose to 
incorporate latent variables bridging the two networks in a joint optimization man¬ 
ner. The first network directly predicts the positions and scales of salient objects 
from raw images, and the latent variables adjust the object localizations to feed 
the second network that produces pixel wise object masks. An EM-type method is 
presented for the optimization, iterating with two steps: (i) by using the two net¬ 
works, it estimates the latent variables by employing an MCMC-based sampling 
method; (ii) it optimizes the parameters of the two networks unitedly via back 
propagation, with the fixed latent variables. Extensive experiments suggest that 
our framework significantly outperforms other state-of-the-art approaches in both 
accuracy and efficiency (e.g. 1000 times faster than competing approaches). 


1 Introduction 

One typical vision problem usually comprises several subproblems, which tend to be tackled jointly 
to achieve superior capability. In this paper, we focus on a general joint task learning framework 
based on deep neural networks, and demonstrate its effectiveness and efficiency on generic (i.e., 
category-independent) object extraction. 

Generally speaking, two sequential subtasks are comprised in object extraction: rapidly localizing 
the objects-of-interest from images and further generating segmentation masks based on the localiza¬ 
tions. Despite acknowledged progresses, previous approaches often tackle these two tasks indepen¬ 
dently, and most of them applied sliding windows over all image locations and scales H71I22L which 
could limit their performances. Recently, several works mm® utilized the interdependencies of 
object localization and segmentation, and showed promising results. For example, Yang et al. ll33l 
introduced a joint framework for object segmentations, in which the segmentation benefits from the 
object detectors and the object detections are in consistent with the underlying segmentation of the 
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image. However, these methods still rely on the exhaustively searching to localize objects. On the 
other hand, deep learning methods have achieved superior capabilities in classification eh Cora 
and representation learning (4), and they also demonstrate good potentials on several complex vision 
tasks l29l 301 20, 25 ]. Motivated by these works, we build a deep learning architecture to jointly 
solve the two subtasks in object extraction, in which each task (either object localization or object 
segmentation) is tackled by a multi-layer convolutional neural network. Specifically, the first net¬ 
work (i.e., localization network) directly predicts the positions and scales of salient objects from raw 
images, upon which the second network (i.e., segmentation network) generates the pixelwise object 
masks. 



Mask Segmentation 

Results 


(a) 



Figure 1: Motivation of introducing latent variables in object extraction. Treating predicted object 
localizations (the dashed red boxes) as the inputs for segmentation may lead to unsatisfactory seg¬ 
mentation results, and we can make improvements by enlarging or shrinking the localizations (the 
solid blue boxes) with the latent variables. Two examples are shown in (a) and (b), respectively. 


Rather than being simply stacked up, the two networks are collaboratively integrated with latent vari¬ 
ables to boost performance. In general, the two networks optimized for different tasks might have 
inconsistent interests. For example, the object localizations predicted by the first network probably 
indicate incomplete object (foreground) regions or include a lot of backgrounds, which may lead 
to unsatisfactory pixelwise segmentation. This observation is well illustrated in Fig. [T] where we 
can obtain better segmentation results through enlarging or shrinking the input object localizations 
(denoted by the bounding boxes). To overcome this problem, we propose to incorporate the latent 
variables between the two networks explicitly indicating the adjustments of object localizations, 
and jointly optimize them with learning the parameters of networks. It is worth mentioning that our 
framework can be generally extended to other applications of joint tasks in similar ways. For concise 
description, we use the term “segmentation reference” to represent the predicted object localization 
plus the adjustment in the following. 

For the framework training, we present an EM-type algorithm, which alternately estimates the la¬ 
tent variables and learns the network parameters. The latent variables are treated as intermediate 
auxiliary during training: we search for the optimal segmentation reference, and back tune the two 
networks accordingly. The latent variable estimation is, however, non-trivial in this work, as it is 
intractable to analytically model the distribution of segmentation reference. To avoid exhaustively 
enumeration, we design a data-driven MCMC method to effectively sample the latent variables, in¬ 
spired by GHEDl. In sum, we conduct the training algorithm iterating with two steps: (i) Fixing the 
network parameters, we estimate the latent variables and determine the optimal segmentation refer¬ 
ence under the sampling method, (ii) Fixing the segmentation reference, the segmentation network 
can be tuned according to the pixelwise segmentation errors, while the localization network tuned 
by taking the adjustment of object localizations into account. 


2 Related Work 

Extracting pixelwise objects-of-interest from an image, our work is related to the salient re¬ 
gion/object detections Eg |2 EH E3. These methods mainly focused on feature engineering and 
graph-based segmentation. For example, Cheng et al. El proposed a regional contrast based saliency 
extraction algorithm and further segmented objects by applying an iterative version of GrabCut. 
Some approaches l22ll27l trained object appearance models and utilized spatial or geometric priors 
to address this task. Kuettel et al. l22ll proposed to transfer segmentation masks from training data 
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into testing images by searching and matching visually similar objects within the sliding windows. 
Other related approaches (28 1 [7) simultaneously processed a batch of images for object discovery 
and co-segmentation, but they often required category information as priors. 

Recently resurgent deep learning methods have also been applied in object detection and image 
segmentation 0Ql[Illl9l|2Ql[lfi[l6l|2l|25l. Among these works, Sermanet et al. [29| detected 
objects by training category-level convolutional neural networks. Ouyang et al. [25] proposed to 
combine multiple components (e.g., feature extraction, occlusion handling, and classification) within 
a deep architecture for human detection. Huang et al. [ 20]] presented the multiscale recursive neural 
networks for robust image segmentation. These mentioned methods generally achieved impressive 
performances, but they usually rely on sliding detect windows over scales and positions of testing 
images. Very recently, Erhan et al. m adopted neural networks to recognize object categories while 
predicting potential object localizations without exhaustive enumeration. This work inspired us to 
design the first network to localize objects. To the best of our knowledge, our framework is original 
to make the different tasks collaboratively optimized by introducing latent variables together with 
network parameter learning. 


3 Deep Architecture 

In this section, we introduce a joint deep model for object extraction(i.e., extracting the segmentation 
mask for a salient object in the image). Our model is presented as comprising two convolutional 
neural networks: localization network and segmentation network, as shown in Fig. [2] Given an 
image as input, our first network can generate a 4-digit output, which specifies the salient object 
bounding box (i.e. the object localization). With the localization result, our segmentation network 
can extract a m x m binary mask for segmentation in its last layer. Both of these networks are stacked 
up by convolutional layers, max-pooling operators and full connection layers. In the following, we 
introduce the detailed definitions for these two networks. 



50x50 

Outputs 


Five Layers 


Three Layers 


Four Layers One Layer 


Localization Network 


Segmentation Network 


Figure 2: The architecture of our joint deep model. It is stacked up by two convolutional neural 
networks: localization network and segmentation network. Given an image, we can generate its 
object bounding box and further extract its segmentation mask accordingly. 

Localization Network. The architecture of the localization network contains eight layers: five 
convolutional layers and three full connection layers. For the parameters setting of the first seven 
layers, we refer to the network used by Krizhevsky et al. |2H . It takes an image with size 224 x 224 x 
3 as input, where each dimension represents height, width and channel numbers. The eighth layer of 
the network contains 4 output neurons, indicating the coordinates (xi, yi, 2 / 2 ) of a salient object 
bounding box. Note that the coordinates are normalized w.r.t. image dimensions into the range of 
0 - 224. 

Segmentation Network. Our segmentation network is a five-layer neural network with four convo¬ 
lutional layers and one full connection layer. To simplify the description, we denote C(k,hxwxc ) 
as a convolutional layer, where k represents kernel numbers, and h,w,c represent the height, width 
and channel numbers for each kernel. We also denote FC as a full connection layer, RN as a 
response normalization layer, and MP as a max-pooling layer. The size of max-pooling operator 
is set as 3 x 3 and the stride for pooling is 2. Then the network architecture can be described as: 
C(256,5 x 5 x 3)-M - MP - C(384,3 x 3 x 256) - C(384,3 x 3 x 384) - C(256,3 x 
3 x 384) — MP — FC. Taking an image with size 55 x 55 x 3 as input, the segmentation network 
generates a binary mask with size 50 x 50 as the output from its full connection layer. 

We then introduce the inference process as object extraction. Formally, we define the parameters 
of the localization network and segmentation network as uj 1 and uj s , respectively. Given an input 
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image 7*, we first resize it to 224 x 224 x 3 as the input for the localization network. Then the output 
of this network via forward propagation is represented as F u i (7^), which indicates a 4-dimension 
vector bi for the salient object bounding box. We crop the image data for salient object according to 
bi , and resize it to 55 x 55 x 3 as the input for the segmentation network. By performing forward 
propagation, the output for segmentation network is represented as F^s (7^, bi), which is a vector 
with 50 x 50 = 2500 dimensions, indicating the binary segmentation result for object extraction. 

4 Learning Algorithm 

We propose a joint deep learning approach to estimate the parameters of two networks. As the 
object bounding boxes indicated by groundtruth object mask might not provide the best references 
for segmentation, we embed this domain-specific prior as latent variables in learning. We adjust the 
object bounding boxes via the latent variables to mine optimal segmentation references for training. 
For optimization, an EM-type algorithm is proposed to iteratively estimate the latent variables and 
the model parameters. 

4.1 Objective Formulation 

Suppose a set of N training images are 7 = {7i,..., 7^}, the segmentation masks for the salient 
objects in them are Y = {Yi,..., Y/v}. For each Yi, we use Y- to represent its jth pixel, and 
Y- = 1 indicates the foreground, while Y? = 0 the background. According to the given object 
masks Y, we can obtain the object bounding boxes around them tightly as L — {L\, ..., L^}, where 
Li is a 4-dimensional vector representing the coordinates {pc\, y\, X 2 , 2 / 2 )- For each sample, we 
introduce a latent variable AT^ as the adjustment for Li. We name the adjusted bounding box as 
segmentation reference, which is represented as Li = Li + A7^. The learning objective is defined 
as the probability of maximizing likelihood estimation (MLE): 

P(Y, L, I\u l ,u a ) = P(L, I\co l )P(Y, I\lo s , Z), (1) 

where the objective is decoupled into two probabilities that are bridged by the segmentation refer¬ 
ence L = {Li,..., Ljsf}. Specifically, P(L, 7|cc z ) represents that the object localization is predicted 
by the first network and P(Y, 7|cc s , L) representing the segmentation estimation. 

For the localization network, we optimize the model parameters by minimizing the Euclidean dis¬ 
tance between the output F u i (7*) and the segmentation reference L i = L i J r A7^. Thus the proba¬ 
bility for and Z can be represented as, 

1 N 

P(L, I\J) = - exp(— (/,) -Li-AL z \\ 2 2 ), (2) 

i= 1 

where Y is a normalization term. 

For the segmentation network, we specify each neuron of the last layer as a binary classification 
output. Then the parameters uj s are estimated via logistic regression, 

N 

P(YJ\co s ,L) = ik n e (Ji, U + A Li) n t 1 - F " s + AL *))) (3) 

i=1 {jIW=i} {ilW=o} 

where F^ s (7^, Li + AT^) is the jth element of the network output, given image L and segmentation 
reference Li + A Li as input. 

For optimization of the model parameters, we solve the MLE objective by minimizing the following 
cost as, 

J(^,w s ,Z) = -7 log P(Y,L,I\lj 1 ,uj s ) (4) 

1 N 

« Y \\K‘(Ii) - Li - ALiWl (5) 

V i=1 

- J2Y 1o S F (Z, Li + A Li) + (1 - Yi) log(l - Fis (. u , Li + A Li)))}, (6) 
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where the first term Q represents the cost for localization network training and the second term ^ 
is the cost for segmentation network training. 


4.2 Joint Task Optimization 


We propose an EM-type algorithm to optimize the learning cost JJ(a/, cu s , L) . As Fig. [^illustrates, 
it includes two iterative steps: (i) fixing the model parameters, apply MCMC based sampling to es¬ 
timate the latent variables which indicate the segmentation references L; (ii) given the segmentation 
references, compute the model parameters of two networks jointly via back propagation. We explain 
these two steps as following. 

k= 1 k=2 k = K k=\ k= 2 k = K 
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Figure 3: The Em-type learning algorithm with two steps:(i) K moves of MCMC sampling (gray 
arrows), the latent variables A Li is sampled with considering both the localization costs (indicated 
by the dashed gray arrow) and segmentation costs, (ii) Given the segmentation reference and result 
after K moves of sampling, we apply back propagation (blue arrows) to estimate parameters of both 
networks. 


(i) Latent variables estimation. Given a training image Ii and current model parameters, we esti¬ 
mate the latent variables A Li. As there is no groundtruth labels for latent variables, it is intractable 
to estimate the distribution of them. It is also time-consuming by enumerating A Li for evaluation 
given the large searching space. Thus we propose a MCMC Metropolis-Hastings method [|24l for 
latent variables sampling, which is processed in K moves. In each step, a new latent variable is 
sampled from the proposal distribution and it is accepted with an acceptance rate. For fast and ef¬ 
fective searching, we design the proposal distribution with a data driven term based on the fact that 
the segmentation boundaries are often aligned with the boundaries of superpixels (T| generated from 
over-segmentation. 

We first initialize the latent variable as A Li = 0. To find a better latent variable A L\ and achieve a 
reversible transition, we define the acceptance rate of the transition from A Li to A L\ as, 


a(ALi —* A L[) = min(l, 


tt(AL') • g(A£< —» ALj) 
w(ALi) ■ q(ALi ->■ AL[) ’ 


where 7r(A Li) is the invariant distribution and q(ALi —AL-) is the proposal distribution. 


(7) 


By replacing the dataset with a single sample in Eq. ([T}, we define the invariant distribution as 
7r(A Li) = P(uj\uj s , Li\Yi, Ii ), which can be decomposed into two probabilities: P(cc z , Li\Y i: Ii) 
constrains the segmentation reference to be close to the output of the localization network; 
P(cc s , Li\Yi, Ii) encourages a segmentation reference contributing to a better segmentation mask. 
To calculate these probabilities, we need to perform forward propagations in both networks. 


The proposal distribution is defined as a combination of a gaussian distribution and a data-driven 
term as, 


q(ALi -► A L\) = J\T(AL f i \fjL i , E*) • P c (AP'|F i , J,) 5 (8) 

where /i 2 and E* is the mean vector and covariance matrix for the optimal A L[ in the previous 
iterations. It is based on the observation that the current optimal AL • has high possibility for being 
selected before. For the data driven term P C (AL'|1^, Ii), it is computed depending on the given 
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image Ii. After over-segmenting Ii into superpixels, we define Vj = 1 if the jth image pixel is on 
the boundary of a superpixel and Vj = 0 if it is inside a superpixel. We then sample c pixels along the 
segmentation reference L • = Li + AL • in equal distance, then the data driven term is represented as 
P C (AL'\Y, I) = ^ J2j=i v j- Thus we encourage to avoid cutting through the possible foreground 
superpixels with the bounding box edges, which leads to more plausible proposals. We set c = 200 
in our experiment, and we only need to perform over-segmentation for superpixels once as pre¬ 
processing for training. 

(ii) Model parameters estimation. As it is shown in Fig. [3j given the optimal latent variable 
AL after K moves of sampling, we can obtain the corresponding segmentation references L and 
the segmentation results. Then the parameters for segmentation network uo s is optimized via back 
propagation with logistic regression(as the second term 0 for Eq. 0), and the parameters for 
localization network c o l is tuned by minimizing the square error between the segmentation references 
and the localization output(as the first term 0 for Eq. 0). 

During back propagation, we apply the stochastic gradient descent to update the model parameters. 
For the segmentation network, we use an equal learning rate for all layers as e\. For localization, 
we first pre-train the network discriminatively for classifying 1000 object categories in the Imagenet 
dataset E). With the pre-training, we can borrow the information learned from a large dataset to 
improve our performance. We maintain the parameters of the convolutional layers and reset the 
parameters of full connection layer randomly as initialization. The learning rate is set as 62 for the 
full connection layers and 62/100 for the convolutional layers. 

5 Experiment 

We validate our approach on the Saliency dataset and a more challenging dataset newly col¬ 
lected by us, namely Object Extraction(OE) dataseQ We compare our approach with state-of-the-art 
methods and empirical analyses are also presented in the experiment. 

The Saliency dataset is a combination of THUR15000 0 and THUS 10000 (9) datasets, which 
includes 16233 images with pixelwise groundtruth masks. Most of the images contain one salient 
object, and we do not utilize the category information in training and testing. We randomly split the 
dataset into 14233 images for training and 2000 images for testing. The OE dataset collected by us 
is more comprehensive, including 10183 images with groundtruth masks. We select the images from 
the PASCAL (BJ, iCoseg 0, Internet [ 28] datasets as well as other data (most of them are about 
people and clothes) from the web. Compared to the traditional segmentation dataset, the salient 
objects in the OE dataset are more variant in appearance and shape(or pose) and they often appear 
in complicated scene with background clutters. For the evaluation in the OE dataset, 8230 samples 
are randomly selected for training and the remaining 1953 ones are applied in testing. 

Experiment Settings. During training, the domain of each element in the 4-dimension latent variable 
vector ALi is set to [—10, —5,0, 5,10], thus there are 5 4 = 625 possible proposals for each A Li. 
We set the number of MCMC sampling moves as K = 20 during searching. The learning rate is 
ei = 1.0 x 10 -6 for the segmentation network and 62 = 1.0 x 10 -8 for the localization network. 
For testing, as each pixelwise output of our method is well discriminated to the number around 1 or 
0, we simply classify it as foreground or background by setting a threshold 0.5. The experiments 
are performed on a desktop with an Intel 17 3.7GHz CPU, 16GB RAM and GTX TITAN GPU. 

5.1 Results and Comparisons 

We now quantitatively evaluate the performance of our method. For evaluation metric, we adopt 
the Precision, P(the average number of pixels which are correctly labeled in both foreground and 
background) and Jaccard similarity, J(the average intersection-over-union score: § 0 ^, where S is 
the foreground pixels obtained via our algorithm and G is the groundtruth foreground pixels). We 
then compare the results of our approach with machine learning based methods such as figure- 
ground segmentation l22l . CPMC 0 and Object Proposals fl3l . As CPMC and Object Proposals 
generates multiple ranked segments intended to cover objects, we follow the process applied in ll22l 
to evaluate its result. We use the union of the top JC ranked segments as salient object prediction. 

1 http://vision.sysu.edu.cn/projects/deep-joint-task-learning/ 
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Ours(full) Ours(sim) 

FgSeg m 

CPMC El 

ObjProp 0 

Hsea 

gcED 

RC(9) 

hcGD 

P 97.81 96.62 

91.92 

83.64 

72.60 

89.99 

89.23 

90.16 

89.24 

J 87.02 81.10 

70.85 

56.14 

54.12 

64.72 

58.30 

63.69 

58.42 


Table 1: The evaluation in Saliency dataset with Precision(P) and Jaccard similarity(J). Ours(full) 
indicates our joint learning method and Ours(sim) means learning two networks separately. 


Ours(full) Ours(sim) 

FgSeg (22) 

CPMC |6]| 

ObjProp (l3l 

HS(32ll 

GClflOl 

RC(9) 

hc ia 

P 93.12 91.25 

90.42 

76.33 

72.14 

87.42 

85.53 

86.25 

83.37 

J 77.69 71.50 

70.93 

53.76 

54.70 

62.83 

54.83 

59.34 

50.61 


Table 2: The evaluation in OE dataset with Precision(P) and Jaccard similarity (J). Ours(full) indi¬ 
cates our joint learning method and Ours(sim) means learning two networks separately. 


We evaluate the performance of all JC G {1,..., 100} and report the best result for each sample 
in our experiment. Besides machine learning based methods, we also report the results of salient 
region detection methods EBE2ED- Note that there are two approaches mentioned in 0 utilizing 
histogram based contrast(HC) and region based contrast(RC). Given the salient maps from these 
methods, an iterative GrabCut proposed in 0 is utilized to generate binary segmentation results. 

Saliency dataset. We report the experiment result in this dataset as Table. [T] Our result with joint task 
learning (namely as Ours(full)) reaches 97.81% in Precision(P) and 87.02% in Jaccard similarity (J). 
Compared to the figure-ground segmentation method ['22'], we have 5.89% improvements in P and 
16.17% in J. For the saliency region detection methods, the best results are P:89.99% and J:64.72% 
in [32]. Our method demonstrates superior performances compared to these approaches. 

OE dataset. The evaluation of our method in OE dataset is shown in Table. [2] By jointly learning 
localization and segmentation networks, our approach with 93.12% in P and 77.69% in J achieves 
the highest performances compared to the state-of-the-art methods. 

One spotlight of our work is its high efficiency in testing. As Table. [3]illustrates, the average time for 
object extraction from an image with our method is 0.014 seconds, while figure-ground segmenta¬ 
tion ll22l requires 94.3 seconds, CPMC (6J requires 59.6 seconds and Object Proposal [13) requires 
37.4 seconds. For most of the saliency region detection methods, the runtime are dominated by the 
iterative GrabCut process, thus we apply its time as the average testing time for the saliency region 
detection methods, which is 0.711 seconds. As a result, our approach is 50 ^ 6000 times faster than 
the state-of-the-art methods. 

During training, it requires around 20 hours for convergence in the Saliency dataset and 13 hours for 
the OE dataset. For latent variable sampling, we also try to enumerate the 625 possible proposals 
exhaustively for each image. It achieves similar accuracy as our approach while costs about 30 times 
of runtime in each iteration of training. 

5.2 Empirical Analysis 

For further evaluation, we conduct two following empirical analyses to illustrate the effectiveness of 
our method. 

(I) To clarify the significance of joint learning instead of learning two networks separately, we dis¬ 
card the latent variables sampling and set all A Li = 0 during training, namely as Ours(sim). We 
illustrate the training cost J(cc z , uo s , L) (Eq. Q) for these two methods as Fig. [ 4 ] We plot the average 
loss over all training samples though the training iterations, and it is shown that our joint learning 


Ours (full) 

FgSeg (22 

CPMC 0 

ObjProp Ca 

Saliency methods 

Time 0.014s 

94.3s 

59.6s 

37.4s 

0.711s 


Table 3: Testing time for each image. The Saliency methods indicates the saliency region detection 
methods mmm. 
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Car 

Horse 

Airplane 


P 

J 

P 

J 

P 

J 

Ours (full) 

87.95 

68.86 

88.11 

53.80 

92.12 

60.10 

Chen et al. 0 

87.09 

64.67 

89.00 

57.58 

90.24 

59.97 

Rubinstein et al. [281 

83.38 

63.36 

83.69 

53.89 

86.14 

55.62 


Table 4: We compare our method with two object discovery and segmentation methods in the 
Internet dataset. We train our model with other data besides the ones in the Internet dataset. 


method can achieve lower costs than the one without latent variable adjustment. We also compare 
these two methods with Precision and Jaccard similarity in both datasets. As Table. [T] illustrates, 
there are 1.19% and 5.92% improvements in P and J when we learn two networks jointly in the 
Saliency dataset. For the OE dataset, the joint learning performs 1.87% higher in P and 6.19% 
higher in J than learning two networks separately, as shown in Table. [2] 

(II) We demonstrate that our method can be well generalized across different datasets. Given the OE 
dataset, we train our model with all the data except for the ones collected from Internet dataset |[28l . 
Then the newly trained model is applied for testing on the Internet dataset. We compare the per¬ 
formance of this deep model with two object discovery and co-segmentation methods [28j[7) in the 
Internet dataset. As Table. [4] illustrates, our method achieves higher performance in the class of Car 
and Airplane, and a comparable result in the class of Horse. Thus our model can be well generalized 
to handle other datasets which are not applied in training and achieve state-of-the-art performances. 
It is also worth to mention that it requires a few seconds for testing via the co-segmentation meth¬ 
ods ESI ED, which is much slower than our approach with 0.014 seconds per image. 




0 20000 40000 60000 80000 100000 120000 0 10000 20000 30000 40000 50000 60000 

Training Iterations Training Iterations 

(a) (b) 

Figure 4: The training cost across iterations. The cost is evaluated over all the training samples in 
each dataset:(a) Saliency dataset;(b) OE dataset. 

6 Conclusion 

This paper studies joint task learning via deep neural networks for generic object extraction, in which 
two networks work collaboratively to boost performance. Our joint deep model has been shown to 
handle well realistic data from the internet. More importantly, the approach for extracting object 
segmentation mask in the image is very efficient and the speed is 1000 times faster than competing 
state-of-the-art methods. The proposed framework can be extended to handle other joint tasks in 
similar ways. 
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